Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
translated by 谷歌翻译
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.
translated by 谷歌翻译
许多最新的自然语言任务方法都建立在大型语言模型的非凡能力上。大型语言模型可以执行内在的学习,他们可以从几个任务演示中学习新任务,而无需任何参数更新。这项工作研究了对新自然语言任务的数据集创建数据集的含义。与最近的文化学习方法背道而驰,我们制定了一个注释效率的两步框架:选择性注释,选择一个示例池,以提前从未标记的数据中从未标记的数据中进行注释,然后及时检索从注释的池中检索任务示例测试时间。基于此框架,我们提出了一种无监督的,基于图的选择性注释方法VOKE-K,以选择各种代表性的示例进行注释。在10个数据集上进行了广泛的实验(涵盖分类,常识性推理,对话和文本/代码生成)表明,我们的选择性注释方法通过很大的利润提高了任务性能。与随机选择示例进行注释相比,Pote-K平均在注释预算下获得了12.9%/11.4%的相对增益。与最先进的监督登录方法相比,它的性能相似,而在10个任务中的注释成本降低了10-100倍。我们在各种情况下进一步分析了框架的有效性:具有不同大小的语言模型,替代选择性注释方法以及有测试数据域移动的情况。我们希望我们的研究将作为数据注释的基础,因为大型语言模型越来越多地应用于新任务。我们的代码可在https://github.com/hkunlp/icl-selactive-annotation上找到。
translated by 谷歌翻译
由于自动驾驶应用程序的高性能和安全要求,现代自动驾驶系统(AD)的复杂性一直在增长,刺激了对更复杂的硬件的需求,这可能会增加广告平台的能量足迹。在解决此问题时,Edge Computing有望包含自动驾驶应用程序,从而使计算密集型的自治任务能够在计算能力的边缘服务器下进行处理。但是,除了严格的鲁棒性需求外,ADS平台的复杂硬件体系结构还阐明了自动驾驶独有的任务卸载并发症。因此,我们提出了$ romanus $,这是一种具有多传感器处理管道的模块化广告平台的可靠和高效任务的方法。我们的方法论需要两个阶段:(i)沿相关深度学习模型的执行路径引入有效的卸载点,以及(ii)基于深度强化学习的运行时解决方案的实现,以根据在操作模式下根据变化的变化来调整操作模式。感知到的道路场景复杂性,网络连接和服务器负载。对象检测用例的实验表明,我们的方法比纯局部执行高14.99%,同时从强大的不稳定卸载基线中降低了危险行为的77.06%。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
大型语言模型经常经过数十万个计算天的训练,已经显示出零和少数学习的显着功能。鉴于它们的计算成本,如果没有大量资本,这些模型很难复制。对于通过API可用的少数产品,没有访问完整的模型权重,因此很难学习。我们提供开放训练的预训练变压器(OPT),这是一套仅解码器预训练的变压器,范围从12500万到175b参数,我们旨在与感兴趣的研究人员完全和负责任地分享。我们表明,OPT-175B与GPT-3相当,而仅需要1/7碳足迹才能开发。我们还释放了日志,详细介绍了我们面临的基础架构挑战,以及用于尝试所有发布模型的代码。
translated by 谷歌翻译
专家层(MOES)的混合物通过条件计算实现语言模型的高效缩放。本文提出了一个详细的实证研究,自回归鞋语言模型与广泛的设置中的密集模型相比:在域外语言建模,零和少量射击和全部微调。除了微调外,我们发现Moes基本上更加计算效率。在更适度的培训预算下,MOES可以使用$ \ SIM值4倍的计算,符合密集模型的性能。该差距在比例下变窄,但我们最大的MOE模型(1.1T参数)始终如一地优于计算等效的密集模型(6.7b参数)。总体而言,这种表现差距在任务和域中有很大差异,表明MOE和密集模型以不值得研究的方式概括不同的方式。我们使我们的代码和模型公开可用于研究使用。
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译